Skip to content

Task 3, 4, and 7#8

Merged
lfoppiano merged 30 commits intomainfrom
luca/feature/part3
Jan 16, 2026
Merged

Task 3, 4, and 7#8
lfoppiano merged 30 commits intomainfrom
luca/feature/part3

Conversation

@lfoppiano
Copy link
Copy Markdown
Collaborator

@lfoppiano lfoppiano commented Dec 28, 2025

Description

This PR implements tasks 3, 4 (and 7, which requires no changes). It uses for the most part JWARC 0.33 (released before Christmas), which introduces supports to CDXJ. However the process of WET and WAT is done with a quick custom code that extends the CdxWriter.java allowing export of record type as they are selected from the user (e.g. --records conversion will select the conversion record and output as CDXJ)

Notes & open questions

I left some TODOs as with more time the code may be integrated into JWARC.

@lfoppiano lfoppiano marked this pull request as ready for review January 5, 2026 14:43
@lfoppiano lfoppiano changed the title Task 3 and 4 Task 3, 4, and 7 Jan 5, 2026
Copy link
Copy Markdown

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Comment thread Makefile
Comment thread Makefile
Comment thread Makefile Outdated
Comment thread Makefile Outdated
# Conflicts:
#	Makefile
#	README.md
#	src/main/java/org/commoncrawl/whirlwind/ValidateWARC.java
@lfoppiano lfoppiano merged commit 4c97de4 into main Jan 16, 2026
1 check passed
@lfoppiano lfoppiano deleted the luca/feature/part3 branch January 16, 2026 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants